Skip to content

fix: WebRTC signaling state guards — ICE restart, renegotiation, and glare (WT-986)#1003

Draft
SERDUN wants to merge 19 commits intodevelopfrom
fix/wt-986-ice-restart-and-renegotiation-guard
Draft

fix: WebRTC signaling state guards — ICE restart, renegotiation, and glare (WT-986)#1003
SERDUN wants to merge 19 commits intodevelopfrom
fix/wt-986-ice-restart-and-renegotiation-guard

Conversation

@SERDUN
Copy link
Member

@SERDUN SERDUN commented Mar 17, 2026

Summary

Four related fixes for WebRTC `RTCPeerConnection` signaling state violations that caused call drops or silent media failures:

1. ICE restart guard (`__onPeerConnectionEventIceConnectionStateChanged`)

  • Outer guard — skips ICE restart if `signalingState != stable` before calling `restartIce()`
  • Inner TOCTOU guard — re-checks state after `await createOffer()` before calling `setLocalDescription`; skips and logs a warning if state changed in the async gap
  • Moves `UpdateRequest` send inside the inner guard so it is never sent without a valid local description

2. Renegotiation guard (`RenegotiationHandler` — extracted from `_handleRenegotiationNeeded`)

  • Replaces the old `pcState != null` check with `pcState == RTCSignalingStateStable`
  • Adds the same TOCTOU re-check after `createOffer()`
  • Prevents `setLocalDescription` from being called in `have-remote-offer` or other non-stable states
  • Extracted into a standalone `RenegotiationHandler` class for testability

3. Glare condition fix (`__onCallSignalingEventUpdating`)

  • Detects when a remote `updating_call` offer arrives while the local PC is in `have-local-offer` state (both peers triggered renegotiation simultaneously, e.g. both enabled video at the same time)
  • Optimistic pre-check + authoritative `on String catch` fallback to handle stale Dart-side `signalingState` cache in `flutter_webrtc`
  • Rolls back the pending local offer via `setLocalDescription(RTCSessionDescription('', 'rollback'))` to return to stable
  • libwebrtc re-fires `onRenegotiationNeeded` once the connection returns to stable

4. Post-glare race fixes (this commit)

Two residual races that prevented video from appearing after the glare rollback:

`RenegotiationHandler` TOCTOU race with `__onCallSignalingEventUpdating`: After glare rollback, `onRenegotiationNeeded` re-fires and `RenegotiationHandler` starts a new offer cycle. However, `__onCallSignalingEventUpdating` concurrently calls `setRemoteDescription`, which moves the PC out of stable between the TOCTOU guard and `setLocalDescription`. Previously this escalated to `callErrorReporter` (user-visible error + call drop). Now caught as a transient race and logged at WARNING only — libwebrtc keeps `[[NegotiationNeeded]]` set and will re-fire `onRenegotiationNeeded` once the PC returns to stable, letting `RenegotiationHandler` succeed on the next attempt.

Server 448 on the glare answer: When both phones simultaneously enable video, each phone's rolled-back offer may still be pending on the server. The server rejects the answer `UpdateRequest` with `448 "SDP type answer is incompatible with session status incall"` because it is still processing the other phone's pending offer. Previously propagated through `catch (e, s)` and called `_ResetStateEvent.completeCall`, dropping the call. Now caught specifically as `WebtritSignalingErrorException` and logged as non-fatal — the PC local exchange is already complete, media continues to flow, and `RenegotiationHandler` re-offers on the next `onRenegotiationNeeded` once the server settles back to incall state.

Root cause (glare)

Although Janus is used as a media server, the WebTrit Core signaling layer forwards `UpdateRequest` SDP between peers without serialising simultaneous renegotiations. This means glare (both peers sending offers at the same time) is possible and must be handled client-side.

Test plan

  • Make two calls (audio only), then have both parties enable video simultaneously — call must not drop and video must appear on both sides
  • Airplane-mode toggle and network switch — ICE restart must succeed (no call drop)
  • Single-party video enable — normal renegotiation must still complete
  • Outgoing call with video from start — unaffected path must still work
  • Incoming call — unaffected path must still work

SERDUN added 14 commits March 17, 2026 12:19
Check RTCSignalingStateStable before creating offer and re-check after
the async createOffer gap to prevent TOCTOU race. Remove outdated TODO.
When both peers enable video simultaneously, both fire onRenegotiationNeeded
while in stable state, creating a race where each sends a local offer before
receiving the other's offer. This leaves both PCs in have-local-offer when
__onCallSignalingEventUpdating fires, causing setRemoteDescription to fail
with "Called in wrong state: have-local-offer" and dropping the call.

Fix: check signalingState before setRemoteDescription in the updating handler.
If have-local-offer, rollback the pending local offer first (returning to stable),
then proceed normally. libwebrtc will re-fire onRenegotiationNeeded once the
connection returns to stable after the answer completes.
@SERDUN SERDUN added the draft Not ready but can be start to review label Mar 17, 2026
SERDUN added 5 commits March 17, 2026 14:21
flutter_webrtc caches signalingState on the Dart side and updates it only
when the onSignalingState callback fires, not when setLocalDescription
completes. The pre-check could therefore read a stale 'stable' value even
though the native PC was already in have-local-offer, causing
setRemoteDescription to fail unhandled.

Add a try-catch on setRemoteDescription that catches the 'have-local-offer'
string error thrown by flutter_webrtc, rolls back the local offer, and
retries. The optimistic pre-check is kept as a fast path for the case where
the Dart-side cache is already up to date.
…otiation

Two separate races during simultaneous video-enable from both peers (glare):

1. RenegotiationHandler: setLocalDescription(offer) fails with "wrong state"
   when a concurrent __onCallSignalingEventUpdating calls setRemoteDescription
   between the TOCTOU guard and setLocalDescription. Previously escalated to
   callErrorReporter (user-visible error). Now caught as a transient race and
   logged at WARNING only — libwebrtc keeps [[NegotiationNeeded]] set and will
   re-fire onRenegotiationNeeded once the PC returns to stable.

2. __onCallSignalingEventUpdating: UpdateRequest(answer) rejected by server
   with 448 "SDP type answer is incompatible with session status incall" during
   glare, because the server is still processing the phone's own pending offer.
   Previously propagated out of the inner try-catch and triggered
   _ResetStateEvent.completeCall, dropping the call. Now caught specifically as
   WebtritSignalingErrorException and logged as a non-fatal warning. The PC local
   offer/answer exchange is already complete, media continues to flow, and the
   pending [[NegotiationNeeded]] causes RenegotiationHandler to re-offer once the
   server settles to incall state.
…rong state

During a glare race the local offer may be rolled back by
__onCallSignalingEventUpdating before the accepted event arrives.
If the PC is already in stable when the accepted answer jsep is
processed, setRemoteDescription(answer) would throw an uncaught
wrong-state exception and video would silently fail to establish.

Added a signalingState pre-check: if type is answer but PC is not in
have-local-offer, skip setRemoteDescription and log a warning.
RenegotiationHandler will fire a fresh offer on the next
onRenegotiationNeeded and the video exchange completes via that cycle.

Also wrapped setRemoteDescription in on String catch to prevent any
residual state races from propagating as unhandled exceptions.
…ess of server response

In the glare+448 scenario the server sends a CallErrorEvent (code 448) instead of
the expected `updated` event. Without this fix `updating: true` is never cleared
by __onCallSignalingEventUpdated, causing the video UI to stay hidden for the
duration of the call.

Clearing the flag unconditionally at the end of __onCallSignalingEventUpdating is
safe: __onCallSignalingEventUpdated setting `false` when already `false` is a no-op.
…re-fire

After a glare-resolution rollback, libwebrtc fires onAddTrack for the
incoming video track but does NOT re-fire onAddStream (onAddStream fires
only once per unique stream ID). The BLoC was relying solely on onAddStream
to update remoteStream, so the video overlay was never mounted and the
renderer was never attached to the stream.

Two complementary fixes:
1. remoteVideo now falls back to the logical `video` flag when the stream is
   absent or has no video tracks yet. This ensures RemoteVideoViewOverlay is
   mounted and the RTCVideoRenderer is attached to the stream — once attached,
   the renderer renders any video track that arrives natively, including tracks
   added after the initial onAddStream.
2. onAddTrack is now forwarded as a streamAdded event so the BLoC updates
   remoteStream when the video track arrives on a different stream object
   (covers cases where Janus creates a new stream for renegotiated video).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

draft Not ready but can be start to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant